Fast Syntactic Searching in Very Large Corpora for Many Languages

نویسندگان

  • Milos Jakubícek
  • Adam Kilgarriff
  • Diana McCarthy
  • Pavel Rychlý
چکیده

For many linguistic investigations, the first step is to find examples. In the 21st century, they should all be found, not invented. Thus linguists need flexible tools for finding even quite rare phenomena. To support linguists well, they need to be fast even where corpora are very large and queries are complex. We present extensions to the CQL ’Corpus Query Language’ for intuitive creation of syntactically rich queries, and demonstrate that they can be computed quickly within our tool even on multi-billion word corpora.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Searching the Annotated Portuguese Childes Corpora

Recently there has been a growing number of initiatives for annotating children’s data for a number of languages, with for instance, part-ofspeech (PoS) and syntactic information (Sagae et al., 2010; Buttery and Korhonen, 2007; Yang, 2010) and some of these are available as part of CHILDES (MacWhinney, 2000). For resource rich languages like English these annotations can be further extended wit...

متن کامل

Measuring the Divergence of Dependency Structures Cross-Linguistically to Improve Syntactic Projection Algorithms

Syntactic parses can provide valuable information for many NLP tasks, such as machine translation, semantic analysis, etc. However, most of the world’s languages do not have large amounts of syntactically annotated corpora available for building parsers. Syntactic projection techniques attempt to address this issue by using parallel corpora between resource-poor and resource-rich languages, boo...

متن کامل

Harmonised large-scale syntactic/semantic lexicons: a European multilingual infrastructure

The paper aims at providing an overview of the situation of Language Resources (LR) in Europe, in particular as emerging from a few European projects regarding the construction of large-scale harmonised resources to be used for many applicative purpose, also of multilingual nature. An important research aspect of the projects is given by the very fact that the large enterprise described is, at ...

متن کامل

A New Approach to Tagging in Indian Languages

In this paper, we present a new approach to automatic tagging without requiring any machine learning algorithm or training data. We argue that the critical information required for tagging comes more from word internal structure than from the context and we show how a well designed morphological analyzer can assign correct tags and disambiguate many cases of tag ambiguities too. The crux of the...

متن کامل

Parsed Corpora for Linguistics

Knowledge-based parsers are now accurate, fast and robust enough to be used to obtain syntactic annotations for very large corpora fully automatically. We argue that such parsed corpora are an interesting new resource for linguists. The argument is illustrated by means of a number of recent results which were established with the help of parsed corpora.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010